Russo et al. (2018)
2025-11-25
Agent repeatedly chooses actions and receives stochastic rewards
Must balance exploitation (choose current best) & exploration (learn about uncertain options)
Bernoulli bandit as example: reward ∈ {0,1} with unknown success probability θₖ
Goal: maximize cumulative reward (minimize regret)
Regret = loss from not always selecting the optimal arm
Link to Robbins (1952): early exploration–exploitation dilemma
Illustration of uncertainty over action rewards via posterior distributions:
→ motivates exploration
Why Greedy Fails
Selects arm with highest current estimate → gets stuck if early samples are unlucky
May never discover the best arm, even with infinite time
Russo’s 3-arm example: greedy keeps exploiting arm 1 forever
Why ε-greedy is still not ideal
Exploration is random, not targeted → wastes samples
Treats all uncertainty equally (uninformative exploration)
Performs poorly in large action spaces
Goal: exploration driven by uncertainty, not randomness.
Why TS is appealing
Randomized, but driven by uncertainty (not noise)
Automatically balances exploration vs. exploitation
Higher posterior uncertainty → higher probability of sampling
Intuition: “Act as if one plausible world is true, then learn if you’re wrong.”
Pseudocode
Maintain a posterior belief over θ
Sample a parameter θ̃ ~ p(θ | data)
Choose the action optimal under θ̃
Observe reward & update the posterior
Repeat
Greedy uses mean estimate θ̂ₖ = αₖ / (αₖ + β) → picks best guess → no exploration
TS samples θ̃ₖ ~ Beta(αₖ, β ₖ)→ stochastic exploration
Posterior updates are identical → only difference is sampling
TS minimizes an information ratio
Leads to regret bounds comparable to UCB
Simple, scalable, Bayesian
Works in many domains (online ads, recommendations, clinical trials)
Russo’s main result: logarithmic regret under broad conditions (no equations needed)
Under what conditions does TS underperform & why?
How sensitive is TS to incorrect or uninformative priors?
How should TS be adapted for drifting or nonstationary rewards?
Can TS be extended reliably to contextual bandits & RL settings?
In what scenarios is UCB a better choice than TS?
“Explore when unsure, exploit when sure.”